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ABSTRACT 


The  consensus  problem  involves  an  asynchronous  system  of  processes,  some  of  which  may  be 
unreliable.  The  problem  is  for  the  reliable  processes  to  agree  on  a  binary  value.  We  show  that 
every  protocol  for  this  problem  has  the  possibility  of  nontermination,  even  with  only  one  faulty 
process.  By  way  of  contrast,  solutions  are  known  for  the  synchronous  case,  the  'Qytantine 
Generals*  problem. 


1.  Introduction 

The  problem  of  reaching  agreement  among  remote  processes  is  one  of  the  most  fundamental 
problems  in  distributed  computing.  It  is  at  the  core  of  many  of  algorithms  for  distributed  data 
processing,  distributed  file  management,  and  fault- tolerant  distributed  applications. 

A  well-known  form  of  the  problem  is  the  “transaction  commit  problem”  which  arises  in 
distributed  database  systems  [DS1,  G,  LS,  La,  Le,  Li,  R,  RLS,  S,  SS].  The  problem  is  for  all  the 
data  manager  processes  which  have  participated  in  the  processing  of  a  particular  transaction  to 
agree  on  whether  to  install  the  transaction’s  results  in  the  database  or  to  discard  them.  The 
latter  action  might  be  necessary,  for  example,  if  some  data  managers  were  for  any  reason  unable 
to  cany  out  the  required  transaction  processing.  Whatever  decision  is  made,  all  data  managers 
must  make  the  same  decision  in  order  to  preserve  the  consistency  of  the  database. 

Reaching  the  type  of  agreement  needed  for  the  “commit”  problem  is  straightforward  if  the 
participating  processes  and  the  network  are  completely  reliable.  However,  real  systems  are 
subject  to  a  number  of  possible  faults  such  as  process  crashes,  network  partitioning,  and  lost, 
distorted  or  duplicated  messages.  One  can  even  consider  more  Byzantine  types  of  failure  [DS2, 
DLM,  DFFLS,  FL,  LFF,  LSP,  PSL]  in  which  faulty  processes  might  go  completely  haywire, 
perhaps  even  sending  messages  according  to  some  malevolent  plan.  One  therefore  wants  an 
agreement  protocol  which  is  as  reliable  as  possible  in  the  presence  of  such  faults.  Of  course,  any 
protocol  can  be  overwhelmed  by  faults  that  are  too  frequent  or  too  severe,  so  the  best  that  one 
can  hope  for  is  a  protocol  which  is  tolerant  to  a  prescribed  number  of  “expected”  faults. 

In  this  paper,  we  show  the  surprising  result  that  no  completely  asynchronous  consensus 
protocol  can  tolerate  even  a  single  unannounced  process  death.  We  do  not  consider  Byzantine 
failures,  and  we  assume  that  the  message  system  is  reliable  —  it  delivers  all  messages  correctly 
and  exactly  once.  Nevertheless,  even  with  these  assumptions,  the  stopping  of  a  single  process  at 


an  inopportune  time  can  cause  any  distributed  commit  protocol  to  fail  to  reach  agreement. 
Thus,  this  important  problem  has  no  robust  solution  without  further  assumptions  about  the 
computing  environment  or  still  greater  restrictions  on  the  kind  of  failures  to  be  tolerated! 

Crucial  to  our  proof  is  that  processing  is  completely  asynchronous,  that  is,  we  make  no 
assumptions  about  the  relative  speeds  of  processes  nor  about  the  delay  time  in  delivering  a 
message.  We  also  assume  that  processes  do  not  have  access  to  synchronised  clocks,  so  algorithms 
based  on  timeouts,  for  example,  cannot  be  used.  (In  particular,  the  solutions  in  [DSl]  are  not 
applicable.)  Finally,  we  do  not  postulate  the  ability  to  detect  the  death  of  a  process,  so  it  is 
impossible  for  one  processes  to  tell  whether  another  has  died  (stopped  entirely)  or  is  just  running 
very  slowly. 

Our  impossibility  result  applies  to  even  a  very  weak  form  of  the  consensus  problem.  Assume 
every  process  starts  with  an  initial  value  in  (0,  1}.  A  nonfaulty  process  decides  on  a  value  in 
(0,  1}  by  entering  an  appropriate  decision  state.  All  nonfaulty  processes  which  decide  are 
required  to  choose  the  same  value.  For  the  purpose  of  the  impossibility  proof,  we  require  only 
that  some  process  eventually  make  a  decision.  (Of  course,  any  algorithm  of  interest  would 
require  that  all  nonfaulty  processes  make  a  decision.)  The  trivial  solution  in  which,  say,  0  is 
always  chosen  is  ruled  out  by  stipulating  that  both  0  and  1  are  possible  decision  values,  although 
perhaps  for  different  initial  configurations. 

Our  system  model  is  rather  strong  so  as  to  make  our  impossibility  proof  as  widely  applicable 
as  possible.  Processes  are  modelled  as  automata  (with  possibly  infinitely  many  states)  which 
communicate  by  means  of  messages.  In  one  atomic  step,  a  process  can  attempt  to  receive  a 
message,  perform  local  computation  based  on  whether  or  not  a  message  was  delivered  to  it  and  if 
so  on  which  one,  and  send  an  arbitrary  but  finite  set  of  messages  to  other  processes.  In 
particular,  an  “atomic  broadcast”  capability  is  assumed,  so  a  process  can  send  the  same  message 
in  one  step  to  all  other  processes  with  the  knowledge  that  if  any  nonfaulty  process  receives  the 
message,  then  all  the  nonfaulty  processes  will.  Every  message  is  eventually  delivered  as  long  as 
the  destination  process  makes  infinitely  many  attempts  to  receive,  but  messages  can  be  delayed 
arbitrarily  long  and  delivered  out  of  order. 

The  asynchronous  commit  protocols  in  current  use  all  seem  to  have  a  “window  of 
vulnerability”  —  an  interval  of  time  during  the  execution  of  the  algorithm  in  which  the  delay  or 
inaccessibility  of  a  single  process  can  cause  the  entire  algorithm  to  wait  indefinitely.  It  follows 
from  our  impossibility  result  that  every  commit  protocol  has  such  a  “window”,  confirming  a 
widely-believed  tenet  in  the  folklore. 


/  Consensus  Protocols 

A  consensus  protocol  P  is  an  asynchronous  system  of  N  processes  (N  >  2).  Each  process  p 
has  a  one-bit  input  register  xp,  an  output  register  yp  with  values  in  {b,  0,  1},  and  an  unbounded 
amount  of  interna]  storage.  The  values  in  the  input  and  output  registers  together  with  the 
program  counter  and  internal  storage  comprise  the  internal  state.  Initial  states  prescribe  fixed 
starting  values  for  all  but  the  input  register;  in  particular,  the  output  register  starts  with  value  b. 
The  states  in  which  the  output  register  has  value  0  or  1  are  distinguished  as  being  decision 
states,  p  acts  deterministically  according  to  a  transition  function.  The  transition  function 
cannot  change  the  value  of  the  output  register  once  the  process  has  reached  a  decision  state;  that 
is,  the  output  register  is  “write-once” .  The  entire  system  P  is  specified  by  the  transition 
functions  associated  with  each  of  the  processes  and  the  initial  values  of  the  input  registers. 

Processes  communicate  by  sending  each  other  messages.  A  message  is  a  pair  (p,  m),  where  p 
is  the  name  of  the  destination  process  and  m  is  a  “message  value”  from  a  fixed  universe  M.  The 
message  system  maintains  a  multiset,  called  the  message  buffer,  of  messages  that  have  been  sent 
but  not  yet  delivered  .  It  supports  two  abstract  operations: 

send(p,  m):  places  (p,  m)  in  the  message  buffer; 

receive(p):  deletes  some  message  (p,  m)  from  the  buffer  and  returns  m,  in  which  case  we 
say  (p,  m)  is  delivered,  or  returns  the  special  null  marker  <j>  and  leaves  the 
buffer  unchanged. 

Thus,  the  message  system  acts  nondeterministically,  subject  only  to  the  condition  that  if 
receive(p)  is  performed  infinitely  many  times,  then  every  message  (p,  m)  in  the  message  buffer  is 
eventually  delivered.  In  particular,  the  message  system  is  allowed  to  return  <f>  a  finite  number  of 
times  in  response  to  receive(p)  even  though  a  message  (p,  m)  is  present  in  the  buffer. 

A  configuration  of  the  system  consists  of  the  internal  state  of  each  process  together  with  the 
contents  of  the  message  buffer.  An  initial  configuration  is  one  in  which  each  process  starts  at 
an  initial  state  and  the  message  buffer  is  empty. 

A  step  takes  one  configuration  to  another  and  consists  of  a  primitive  step  by  a  single  process 
p.  Let  C  be  a  configuration.  The  step  occurs  in  two  phases.  First,  receive(p)  is  performed  on 
the  message  buffer  in  C  to  obtain  a  value  m  €  M  U  {</>}•  Then,  depending  on  p's  internal  state 
in  C  and  on  m,  p  enters  a  new  internal  state  and  sends  a  finite  set  of  messages  to  other  processes. 
Since  processes  are  deterministic,  the  step  is  completely  determined  by  the  pair  e  =  (p,  m),  which 
we  call  an  event.  (This  “event”  should  be  thought  of  as  the  receipt  of  m  by  p.)  e(C)  denotes  the 
resulting  configuration  and  we  say  that  e  can  be  applied  to  C.  Note  that  the  event  (p,  <j>)  can 
always  be  applied  to  C,  so  it  is  always  possible  for  a  process  to  take  another  step. 


A  schedule  from  C  is  a  finite  or  infinite  sequence  a  of  events  which  can  be  applied,  in  turn, 
starting  from  C.  The  associated  sequence  of  steps  is  called  a  run.  If  a  is  finite,  we  let  <r(C) 
denote  the  resulting  configuration,  which  is  said  to  be  reachable  from  C.  A  configuration 
reachable  from  some  initial  configuration  is  said  to  be  accessible.  Hereafter,  all  configurations 
mentioned  are  assumed  to  be  accessible. 

The  following  lemma  expresses  a  “commutativity”  property  of  schedules. 

Lemma  1.  Suppose  that  from  some  configuration  C  the  schedules  <r2  lead  to 
configurations  C.,  C0  respectively.  If  the  sets  of  processes  taking  steps  in  trl  and  <r„ 
respectively  are  disjoint,  then  <r„  can  be  applied  to  Cj  and  can  be  applied  to  C2,  and 
both  lead  to  the  same  configuration  Cj.  (See  Figure  I.) 


Figure  1. 


Proof.  The  result  follows  at  once  from  the  system  definition  since  tri  and  <r2  do  not  interact. 

□ 


A  configuration  C  has  decision  value  v  if  some  process  p  is  in  a  decision  state  with  yp  *=  v. 
A  consensus  protocol  is  partially  correct  if  it  satisfies  two  conditions: 

1 .  No  accessible  configuration  has  more  than  one  decision  value. 

2.  For  each  v  6  {0.  1},  some  accessible  configuration  has  decision  value  v. 

A  process  p  is  non  faulty  in  a  run  provided  it  takes  infinitely  many  steps,  and  is  faulty 
otherwise.  A  run  is  admissible  provided  at  most  one  process  is  faulty,  and  provided  all  messages 
sent  to  nonfaulty  processes  are  eventually  received. 

A  run  is  a  deciding  run  provided  some  process  reaches  a  decision  state  in  that  run.  A 
consensus  protocol  P  is  totally  correct  in  spite  of  one  fault  if  it  is  partially  correct,  and  every 
admissible  run  is  a  deciding  run.  Our  main  theorem  shows  that  every  partially  correct  protocol 
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for  the  consensus  problem  has  some  admissible  run  which  is  not  a  deciding  run. 

3.  Main  Result 

Theorem  I.  No  consensus  protocol  is  totally  correct  in  spite  of  one  fault. 

Proof.  Assume  to  the  contrary  that  P  is  a  consensus  protocol  which  is  totally  correct  in  spite 
of  one  fault.  We  prove  a  sequence  of  lemmas  which  eventually  lead  to  a  contradiction. 

The  basic  idea  is  to  show  circumstances  under  which  the  protocol  remains  forever  indecisive. 
This  involves  two  steps.  First,  we  argue  that  there  is  some  initial  configuration  in  which  the 
decision  is  not  already  predetermined.  Secondly,  we  construct  an  admissible  run  which  avoids 
ever  taking  a  step  that  would  commit  the  system  to  a  particular  decision. 

Let  C  be  a  configuration  and  let  V  be  the  set  of  decision  values  of  configurations  reachable 
from  C.  C  is  bivalent  if  |V|  =*  2.  C  is  univalent  if  |V|  =  1,  let  us  say  0-va/ent  or  1  -valent 
according  to  the  corresponding  decision  value.  By  the  total  correctness  of  P,  and  the  fact  that 
these  are  always  admissible  runs,  V  ft 

Lemma  L  P  has  a  bivalent  initial  configuration. 

Proof.  Assume  not.  Then  P  must  have  both  (Kvalent  and  l-valent  initial  configurations  by 
the  assumed  partial  correctness.  Let  us  call  two  initial  configurations  adjacent  if  they  differ  only 
in  the  initial  value  xp  of  a  single  process  p.  Any  two  initial  configurations  are  joined  by  a  chain 
of  initial  configurations,  each  adjacent  to  the  next.  Hence,  there  must  exist  a  O-valent  initial 
configuration  CQ  adjacent  to  a  l-valent  initial  configuration  C,.  Let  p  be  the  process  in  whose 
initial  value  they  differ. 

Now  consider  some  admissible  deciding  run  from  CQ  in  which  process  p  takes  no  steps,  and  let 
o  be  the  associated  schedule.  Then  er  can  be  applied  to  also,  and  corresponding  configurations 
in  the  two  runs  are  identical  except  for  the  internal  state  of  process  p.  It  is  easily  shown  that 
both  runs  eventually  reach  the  same  decision  value.  If  the  value  is  1,  then  CQ  is  bivalent; 
otherwise,  Cj  is  bivalent.  Either  case  contradicts  the  assumed  nonexistence  of  a  bivalent  initial 
configuration.  □ 

Lemma  3.  Let  C  be  a  bivalent  configuration  of  P,  and  let  e  =  (p,  m)  be  an  event 
which  is  applicable  to  C.  Let  C  be  the  set  of  configurations  reachable  from  C  without 
applying  e,  and  let  V  —  e(C)  “  {e(E)|  EEC  and  e  is  applicable  to  E}.  Then  D 
contains  a  bivalent  configuration. 

Proof.  Since  e  is  applicable  to  C,  then  by  definition  of  C  and  the  fact  that  messages  can  be 
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delayed  arbitrarily,  e  is  applicable  to  every  EEC. 

Now  assume  that  D  contains  no  bivalent  configurations,  so  every  configuration  D  €  D  is 
univalent.  We  proceed  to  derive  a  contradiction. 

Let  Ej  be  an  i-valent  configuration  reachable  from  C,  i  =  0,  1.  (E;  exists  since  C  is  bivalent.) 
If  Ej  €  C,  let  Fj  e(E;)  €  D.  Otherwise,  e  was  applied  in  reaching  Ej,  and  so  there  exists  Fj  6  D 
from  which  Ej  is  reachable.  In  either  case,  Fj  is  i-valent  since  F;  is  not  bivalent  (by  assumption) 
and  one  of  Ej  and  Fj  is  reachable  from  the  other.  Since  Fj  6  D,  i  =  0,  1,  D  contains  both  0- 
valent  and  1 -valent  configurations. 

Call  two  configurations  neighbors  if  one  results  from  the  other  in  a  single  step.  By  an  easy 
induction,  there  exist  neighbors  C0,  Cj  €  C  such  that  Dj  =  e(Cj)  is  i-valent,  i  =  0,  1.  Without 
loss  of  generality,  Cj  =  e\Cfi)  where  e'  =  (p*,  m'). 

CASE  1:  If  p'  p,  then  Dj  =  e'(D0)  by  Lemma  1.  This  is  impossible  since  any  successor  of 
a  0- valent  configuration  is  0-valent.  (See  Figure  2.) 


Figure  2. 

CASE  2:  If  p'  =  p,  then  consider  any  finite  deciding  p-free  run  from  C0  with  corresponding 
schedule  a,  and  let  A  —  o(CQ).  By  Lemma  1,  a  is  applicable  to  Dj,  and  it  leads  to  an  i-valent 
configuration  E;  =*  <r(Dj),  i  =  0,  1.  Also  by  Lemma  1,  e(A)  =■  E0  and  e(e\A))  =  Ej.  (See 
Figure  3.)  Hence,  A  is  bivalent,  which  is  impossible  since  A  is  univalent. 

In  each  case,  we  reached  a  contradiction,  so  0  contains  a  bivalent  configuration.  □ 

Any  deciding  run  from  a  bivalent  initial  configuration  goes  to  a  univalent  configuration,  so 
there  must  be  some  single  step  which  goes  from  a  bivalent  to  a  univalent  configuration.  Such  a 
step  determines  the  eventual  decision  value.  We  now  show  that  it  is  always  possible  to  run  the 
system  in  a  way  that  avoids  such  steps,  leading  to  an  admissible  non-deciding  run. 


r  .*  /  .*  .* 


r. 


Figure  3. 

The  run  is  constructed  in  stages,  starting  from  an  initial  configuration.  We  ensure  that  the 
run  is  admissible  in  the  following  way.  A  queue  of  processes  is  maintained,  initially  in  an 
arbitrary  order,  and  the  message  buffer  in  a  configuration  is  ordered  according  to  the  time  the 
messages  were  sent,  earliest  first.  Each  stage  consists  of  one  or  more  process  steps.  The  stage 
ends  with  the  first  process  in  the  process  queue  taking  a  step  in  which,  if  its  message  queue  was 
not  empty  at  the  start  of  the  stage,  its  earliest  message  is  received.  This  process  is  then  moved 
to  the  back  of  the  process  queue.  In  any  infinite  sequence  of  such  stages  every  process  takes 
infinitely  many  steps  and  receives  every  message  sent  to  it.  The  run  is  therefore  admissible.  Our 
problem  of  course  is  to  do  this  in  such  a  way  as  to  avoid  a  decision  ever  being  reached. 

Let  CQ  be  a  bivalent  initial  configuration  whose  existence  is  assured  by  Lemma  2.  Execution 
begins  in  CQ,  and  we  ensure  that  every  stage  begins  from  a  bivalent  configuration.  Suppose  then 
that  configuration  C  is  bivalent  and  that  process  p  heads  the  priority  queue.  Let  m  be  the 
earliest  message  to  p  in  C’s  message  buffer,  if  any,  and  0  otherwise.  Let  e  =  (p,  m).  By  Lemma 
3,  there  is  a  bivalent  configuration  C  reachable  from  C  by  a  schedule  in  which  e  is  the  last  event 
applied.  The  corresponding  sequence  of  steps  defines  the  stage. 

Since  each  stage  ends  in  a  bivalent  configuration,  every  stage  in  the  construction  of  the 
infinite  schedule  succeeds.  The  resulting  run  is  admissible,  and  no  decision  is  ever  reached.  It 
follows  that  P  is  not  totally  correct.  □ 
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4.  Initially  Dead  Processes 

In  this  section,  we  exhibit  a  protocol  which  solves  the  consensus  problem  for  N  processes  as 
long  as  a  majority  of  the  processes  are  non-faulty  and  no  process  dies  during  the  execution  of  the 
protocol.  No  process  knows  in  advance,  however,  which  of  the  processes  are  initially  dead  and 
which  are  not. 

The  protocol  works  in  two  stages.  During  the  first  stage,  the  processes  construct  a  directed 
graph  G  with  a  node  corresponding  to  each  process.  Every  process  broadcasts  a  message 
containing  its  process  number  and  initial  value  and  then  listens  for  messages  from  L—  I  other 
processes,  where  L  =  f(N  +  l)/2].  G  has  an  edge  from  i  to  j  iff  j  receives  a  message  from  i. 
Thus,  G  has  indegree  L— 1. 

In  the  second  stage,  the  processes  construct  G+,  the  transitive  closure  of  G,  in  the  sense  that 
upon  completion  of  this  stage,  each  process  k  knows  about  all  of  the  edges  (j,  k)  incident  on  k  in 
G+.  Each  k  also  knows  the  initial  values  of  all  such  j.  After  k  discovers  such  an  edge  in  G+,  we 
say  that  k  knows  about  that  edge  and  about  the  node  j. 

The  computation  of  G+  is  carried  out  in  the  following  way.  First,  each  process  broadcasts  to 
all  other  processes  its  process  number  and  initial  value  together  with  the  names  of  the  L— 1 
processes  it  heard  from  during  the  first  stage.  It  then  waits  until  it  has  received  both  the  stage  1 
and  stage  2  messages  from  all  its  ancestors  in  G  which  it  knows  about.  It  initially  knows  only 
about  the  L— 1  processes  from  which  it  heard  directly  during  the  first  stage,  but  as  it  receives 
stage  2  messages,  it  may  discover  additional  ancestors.  Waiting  continues  until  such  time  as  all 
currently  known  about  processes  have  been  heard  from. 

At  this  point,  each  process  knows  all  of  its  own  ancestors  and  the  edges  of  G  incident  on 
them,  so  it  can  compute  all  of  the  edges  of  G+  incident  on  each  of  its  ancestors.  This  enables  it 
to  determine  which  of  its  ancestors  belong  to  an  initial  clique  of  G+,  that  is,  a  clique  with  no 
incoming  edges,  for  node  k  is  in  an  initial  clique  iff  k  is  itself  an  ancestor  of  every  one  of  its 
ancestors.  Since  evety  node  in  G+  has  at  least  L—  1  predecessors,  there  can  be  only  one  initial 
clique,  it  has  cardinality  at  least  L,  and  every  process  which  completes  the  second  stage  knows 
exactly  the  set  of  processes  comprising  it. 

Finally,  each  process  makes  a  decision  based  on  the  initial  values  of  the  processes  in  the  initial 
clique  using  any  agreed-upon  rule.  Since  all  processes  know  the  initial  values  of  all  members  of 
the  initial  clique,  they  all  reach  the  same  decision. 

The  correctness  of  this  protocol  proves  the  following  theorem. 

Theorem  II.  There  is  a  p-  ially  cor  t  consensus  protocol  in  which  ail  nonfaulty 
processes  always  reach  a  decis,.  ire  ied  no  processes  die  during  its  execution  and  a 


strict  majority  of  the  processes  are  alive  initially. 
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